perm filename HOWTO[4,KMC]2 blob sn#117252 filedate 1974-08-23 generic text, type T, neo UTF8
00100	
00200	
00300	
00400	HOW TO MEASURE IMPROVEMENT OF A SIMULATION MODEL
00500	   ALONG A DIMENSION OF LINGUISTIC COMPREHENSION
00600	
00700	COLBY, HILF, WITTNER, PARKISON, FAUGHT
00800	
00900	
01000		To measure improvement one needs a  scaled  dimension  and  a
01100	value   on   that   dimension  to  be  striven  for.  In  a  previous
01200	communication (Colby and Hilf, 1974) a method was described for using
01300	judges  to  rate  a  paranoid  simulation model's performance along a
01400	variety of dimensions. The  judges  consisted  of  randomly  selected
01500	psychiatrists  who  rated  transcripts  of  interviews  conducted  in
01600	natural language by other psychiatrists with  paranoid  patients  and
01700	with  versions of the model (PARRY1). The interviewers and the raters
01800	did not know that one of the interviewees was a  computer  simulation
01900	of paranoid processes.
02000		One  of the rated dimensions was linguistic noncomprehension.
02100	(The negation "non" was used to  keep  the  ratings  consistent  with
02200	other  ratings  being  made at the same time). A judge rated each I-O
02300	pair of an interview along this dimension on  a  scale  of  0-9.  The
02400	judges  proved to be reliable [Frank- concordance scores here on this
02500	dimension]. The mean score received by the patients was 0.74  and  by
02600	the  model  2.22.  The  difference  between  the  two mean ratings is
02700	significant at better than the 0.001 level.
02800		Close  study of the reasons for this difference revealed that
02900	the model recognized topics in the natural language input but did not
03000	sufficiently  recognize exacly what was being said about a topic. The
03100	pattern-recognition  processes  of  the  model  failed  to  pick   up
03200	sufficient  information  about  a  topic  to  give a reply indicating
03300	comprehension. The power of a pattern- matching approach in  language
03400	recognition  is  the  ability  to  ignore  as irrelevant both what it
03500	recognizes and what it does not recognize at all. Its  weakness  lies
03600	in  not  having  enough  patterns  to match the tremendous variety of
03700	expressions found in natural language dialogues.
03900		To improve the language-recognition processes of the model
04000	we designed several additional techniques which we shall only outline 
04100	here. A complete description of them can be found in Colby, Parkison
04200	and Faught (1974).
04300		In brief, the language-recognizing module of the current 
04400	paranoid model (PARRY2) progressively transforms the input until
04500	a pattern is achieved which completely or fuzzily matches a more
04600	abstract stored pattern. (See the flow diagram of Fig. 1). The
04700	input expression is first preprocessed by translating words and
04800	word groups (such as idioms) into internal synonyms which represent
04900	our names of word classes. Words not in the recognizer's dictionary
05000	are not included in the pattern being formed. Misspellings are
05100	corrected, groups of words are contracted into single words, and
05200	certain expansions are made (e.g. "dont" becomes "do not"). The
05300	pattern is then bracketted into shorter, more manageable units
05400	termed "segments". The resultant pattern is classified as "simple",
05500	containing no  delimiters, or "complex", consisting of two or more
05600	simple patterns.
05700		The algorithm then attempts a complete match of the
05800	segments with stored simple patterns. When a match is found, the
05900	stored pattern points to the name of a response function in
06000	"memory" which decides what to do next. If a match is not found, a fuzzy
06100	match is tried bt dropping elements in a segment one at a time
06200	and trying for a match each time. In the case of complex patterns
06300	this one-at-a-time dropping is carried out at the segment level. If
06400	these methods do not produce a match, a default condition obtains
06500	and the response module decides what to do.
06600		For this language-recognition strategy to be
06700	successful, a large number of words and word-combinations
06800	must be recognized and converted into patterns which match
06900	stored patterns. In  the first experiment to be described, there
07000	were 1900 dictionary entries and about 2200 patterns, 1700 being
07100	simple and 500 complex.
07200	
07300			EXPERIMENT 1
07400	
07500			METHOD
07600	
07700		Five clinicians interviewed both the old (PARRY1) and
07800	new (PARRY2) versions of the model without knowing which was which.
07900	All five agreed PARRY2 showed greater linguistic comprehension.
08000	To obtain a more precise estimate, 19 graduate students were
08100	paid to rate transcripts of these interviews. They rated each
08200	I-O pair of each interview along a dimension of "linguistic
08300	comprehension" ("Did the patient understand what the doctor
08400	said?") on a 0-9 scale.
08500			RESULTS
08600	
08700		In the 10 interviews there was a total of %%%% I-O pairs.
08800	On a 0-9 scale of linguistic comprehension, the mean rating of
08900	PARRY1 was 5.256 and the mean rating of PARRY2 was 5.483. This
09000	difference is significant at the 0.05 level (t=1.0935, one
09100	tailed test).
09200		These raters also rated transcripts of the original
09300	eight interviews conducted by psychiatrists with PARRY1 and
09400	with paranoid patients. PARRY1 received a mean rating 5.19 and
09500	the patients 7.42. The difference is significant at the 0.001
09600	level. This confirms the original test using psychiatrists
09700	as raters. (Frank---how does it?)
09800		The student raters gave PARRY1 in the original interviews
09900	a mean rating of 5.19 and a mean rating of 5.26 in the experiment
10000	under discussion. The difference is not statistically significant
10100	( SD(difference)=0.1497, t=0.45, p<0.80). We can conclude the
10200	student raters are reliable and PARRY1 generates reliable
10300	ratings from two groups of raters.
10400	
10500			DISCUSSION
10600	
10700	
10800		The improvement (more towards the ratings received by
10900	patients) of PARRY2 over PARRY1 along the dimension of linguistic
11000	comprehension is statistically significant. However Parry2's rating
11100	of 5.48 is still distant from the rating of 7.42 received by the
11200	patients. How close should a simulation model come to its natural
11300	counterpart? Everybody knows that noboby knows. Perhaps we have
11400	reached the limit of approximation. Intuitively it seemed the model
11500	should be able to do better if we could pinpoint its most serious
11600	inadequacies.
11700		We looked at each I-O pair which received a mean rating
11800	of 5.0 or less. There were %%%% such cases. In %%% of these cases
11900	the pattern was recognized but, dues to our own errors, the pointers
12000	pointed to the wrong response functions. In the %%% remaining cases,
12100	the pattern was not recognized. We corrected the pointers and then  
12200	repeated the experiment using five different clinicians who interviewed
12300	PARRY1 and PARRY2.
12400	
12500	                   EXPERIMENT 2